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ABSTRACT 

The TDR Targets Database (http://tdrtargets.org) 
has been designed and developed as an online 
resource to facilitate the rapid identification and 
prioritization of molecular targets for drug develop- 
ment, focusing on pathogens responsible for 
neglected human diseases. The database integrates 
pathogen specific genomic information with func- 
tional data (e.g. expression, phylogeny, essentiality) 
for genes collected from various sources, including 
literature curation. This information can be browsed 
and queried using an extensive web interface with 
functionalities for combining, saving, exporting 
and sharing the query results. Target genes can be 
ranked and prioritized using numerical weights 
assigned to the criteria used for querying. In this 
report we describe recent updates to the TDR 
Targets database, including the addition of new 
genomes (specifically helminths), and integration 
of chemical structure, property and bioactivity 
information for biological ligands, drugs and inhibi- 
tors and cheminformatic tools for querying and 
visualizing these chemical data. These changes 
greatly facilitate exploration of linkages (both 
known and predicted) between genes and small 
molecules, yielding insight into whether particular 
proteins may be druggable, effectively allowing the 
navigation of chemical space in a genomics context. 

BACKGROUND 

The open access, web accessible TDR Targets database 
(http://tdrtargets.org) (1), allows users to interrogate 



pathogen specific genomic-scale information and to 
identify and prioritize high value targets based on 
whether or not they fulfill a set of user defined criteria. 
The name of the database includes the initialism TDR' 
for Tropical Disease Research, a special program within 
the World Health Organization (see Acknowledgements). 
The focus of the TDR Targets database is on high priority 
tropical disease pathogens (currently the top ten patho- 
gens in the portfolio of this special program), and 
several other phylogenetically relevant pathogens 
(the Wolbachia endosymbiont of Brugia malayi and the 
apicomplexan parasite Toxoplasma gondii, for example). 
The database integrates information on gene products 
from primary genome databases (2-6), and gathers, 
from various resources and published studies, organism- 
specific functional information such as information on 
orthologues (7), 3D structures (8) and/or structural 
models [modehng of pathogen proteins were obtained 
for this work, and are now available from Modbase (9)], 
enzyme/metabolic pathway classification, expression and 
essentiahty (1). These datasets are further supplemented 
with information curated from the literature on chemical 
and/or genetic validation status of targets, precedence for 
druggability and assayabihty. Ultimately, the genomic 
scale datasets compiled in TDR Targets should aid the 
tropical infectious disease community in driving drug 
discovery efforts. The importance of this is obvious, 
given the urgent need for new drugs, the rapid emergence 
of resistance and toxicity issues associated with existing 
drugs, and considering that the drug discovery pipehne 
for tropical infectious diseases is rather thin due to the 
chronic underfunding of the field and a lack of commercial 
interest from big pharmaceutical companies. 

Combining genomics data with chemical data is essen- 
tial for the success of discovery efforts. The availabihty 
of large compound datasets is particularly important in 



*To whom correspondence should be addressed. Tel: +54 11 4580 7255 (Ext. 310); Fax: +54 11 4752 9639; Email: fernan@unsam.edu.ar, 
fernan.aguero@gmail.com 

© The Author(s) 2011. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License {http://creativecommons.org/licenses/ 
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



Nucleic Acids Research, 2012, Vol. 40, Database issue D1119 



tropical infectious disease research. Until recently, 
however, access to large-scale pharmacological and medi- 
cinal chemistry datasets was limited (mostly due to the 
proprietary nature of the data), or prohibitively expensive. 
However, this landscape started to change dramatically in 
the past 5 years with the advent of a number of open 
access chemical resources such as PubChem, ChEMBL 
and others (10), and by the release of key high-throughput 
screening datasets by both academic and pharmaceutical 
companies (11,12) (Novartis-GNF Malaria Box, K 
Gagaring, R Borboa, C Francek, Z Chen, J Buenviaje, 
D Plouffe, E Winzeler, A Brinker, T Diagana, J Taylor, 
R Glynne, A Chatterjee, K Kuhen. Genomics Institute 
of the Novartis Research Foundation (GNF) USA, and 
Novartis Institute for Tropical Disease, Singapore). 

In order to complement the genomic datasets and 
target-focused functionalities available in TDR Targets 
(1), we integrated in the database a number of chemical 
datasets enriched in drugs and drug-hke molecules 
collected from various sources, and developed cheminfor- 
niatics components that drive a data loading pipehne 
and parts of the web application, allowing users to mine 
these data. In this report, we describe these new 
functionalities and data, and how the integration of 
chemical and genomics data can be used to formulate 
relevant queries to identify either new chemical leads for 
a target, or candidate new targets for orphan bioactive 
molecules. 

Several existing databases allow searching for target 
and/or drug information, but each of these has been 
designed for a different purpose. Some are focused pre- 
dominantly on describing chemical entities, while others 
are focused on specific aspects of proteins that make them 
possible targets, such as position in a metabolic pathway 
or structural features. Examples of databases focused on 
chemical entities include ChemBank (13), which is focused 
on small molecule activities, obtained from a wide variety 
of screenings; ChEBI (Chemical Entities of Biological 
Interest) (14), which contains chemical information and 
properties of small molecules; and Drugbank (15) which 
holds chemical data on FDA approved drugs, nutraceut- 
ical and experimental drugs, together with information 
about their related proteins (targets, enzymes, etc.). 
Others like BindingDB (16) catalogue experimentally 
vahdated protein-ligand interactions relevant to drug 
discovery and provide detailed information on binding 
affinities and inhibitory kinetics. The SuperDrug 
Database (17) allows users to search and compare struc- 
tures of approved drugs, while the Antimicrobial peptide 
database (18) also allows users to browse or search 
for peptides that target infectious agents or cancer cells 
based on their activity and content. There are several 
databases focused primarily on protein targets — most of 
these focus on a single prioritization theme, or a single 
organism. For example, the Genomic Target Database 
(19) contains lists of drug targets for four human bacterial 
pathogens, with specific hsts of targets grouped by meta- 
bolic pathway or membrane localization. The Potential 
Drug Target Database (PDTD) (20), allows users to 
browse solved structures of potential drug targets and to 
identify potential targets of a given small molecule 



through docking into many solved structures. The 
Therapeutic Targets Database (21) contains a large 
volume of data on known and predicted targets, 
allowing users to download target validation data for 
individual proteins, to view Quantitative Structure- 
Activity Relationship (QSAR) data for individual 
proteins, or to search targets by linked compounds. 
Finally, there are a number of cheminformatics resources 
such as HEOS (22) that facihtate the remote collaboration 
between groups involved in a drug discovery process. 
However, very few of these databases offer the same 
breadth of data-types relevant to drug target characteriza- 
tion as TDR Targets. ChEMBL and PubChem are two 
examples of chemical databases that provide additional 
hnks to protein targets. Nonetheless, TDR Targets 
stands alone in its ability to allow users to frame 
weighted, complex queries to rank and interrogate many 
targets at once from different species, and to cross-relate 
chemical data with target data in meaningful ways, as 
described in this report. In this respect we beheve TDR 
Targets fills an important niche. To our knowledge, the 
TDR Targets database represents a unique resource 
providing comprehensive compilation of relevant data 
for drug target prioritization on multiple pathogens, all 
in one place. 



NEW FUNCTIONALITY: INTEGRATION OF 
CHEMICAL DATA IN TDR TARGETS 

In order to collect chemical information that would be 
useful to search for new bioactive compounds against 
tropical diseases, we selected data sources enriched in 
drugs and drug-like molecules. Chemical compounds 
hsted in our database come from: the ChEMBL 
database (http://www.ebi.ac.uk/chembldb), that contains 
information on small molecules together with bioactivity 
information curated from the hterature, and information 
on associated protein targets (23) (443 602 compounds); 
PubChem (24) (278 070 compounds); the DrugBank 
database, which contains information on FDA approved 
drugs (15) (4421 compounds). The Tres Cantos 
Antimalarial TCAMS dataset (GSK) (13 469 compounds) 
(1 1), the Novartis-GNF Malaria Box database (5388 com- 
pounds), and the St. Jude Children's Research Hospital 
Malaria dataset (305 815 compounds)(12), three datasets 
that contain molecules tested in high-throughput 
screening assays against Plasmodium falciparum, were 
obtained from ChEMBL-NTD (http://www.ebi.ac.uk/ 
chemblntd), a repository of screening data of molecules 
directed against neglected diseases. 

A total of 825 814 unique drug-like compounds 
are present in the combined dataset obtained from the 
above indicated sources. These were integrated into the 
TDR Targets database along with data on: (i) basic infor- 
mation such as chemical name, InChi and InChiKey 
identifiers; (ii) chemical properties (molecular weight, 
logP, hydrogen bond donors, hydrogen bond acceptors, 
flexible bonds, Lipinski rule of 5 comphance); (iii) chemical 
structure; (iv) bioactivities (IC50, Ki, MIC, Activity, 
EC50, ED50 and percent growth inhibition, among 
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Table 1. Searchable chemical information in TDR Targets 
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C7H10NO7P2 
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-3.23 
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4 




No. of H acceptors 


8 




No. of flexible bonds 


4 


Activity 


Assay 


e.g. Inhibitory activity against Leishmania major Farnesyl diphosphate synthase 




Readout 


IC50, EC50, Ki, MIC, % growth inhibition, etc. (depends on particular assay) 


Target association 


Direct 


Manual curation (experimental evidence, target directly assayed) 




Transitive 


Experimental evidence available for an ortholog/homolog 



Searchable information fields for small molecules integrated in TDR Targets are shown in the Table. The 2D structure of a molecule is used for 
similarity and substructure searches, and in all searches started from the JME molecule editor. All other information fields are searchable as textual 
or numeric information using standard forins (see Figure 3 for some examples). Target associations are used internally to limit search results to show 
only those compounds that are associated with a target (see examples in Figure 2), and to display links to targets within pages. Note: the InChI string 
in tire table has been truncated for presentation purposes. 



Others) and (v) association with genes (curated, predicted 
or any). Table 1 provides a summary of searchable fields 
for chemical data. Users can access this information either 
by searching for target genes, and then looking up the 
chemicals Hnked to these genes (scenario A in Figure 1) 
or by searching the chemical space itself (scenario B in 
Figure 1). 

In a target search, it is now possible to query for tar- 
gets with associated compounds and to further filter this 
search based on evidence for association (i.e. curated or 
predicted, or any). After running any target search, a hst 
of targets is shown with links to all compounds associated 
with all the targets present in the hst (Figure 1). 

From the compound search page (http://tdrtargets.org/ 
drugs) users can query chemical information using either 
text-based searches or structure-based searches (Figures 1 
and 2) to retrieve a list of matching compounds. Users can 
then choose to either view details of individual compounds 
in the Hst (by clicking on the relevant compound) or 
retrieve all genes associated with the compounds hsted 
using the 'show all/curated/predicted genes' hnk. In the 
first case, a new page is shown, with a detailed view for 
the compound, including basic chemical information, 
synonyms, structure, association with targets (and kind 
of association), activities described for the compound, 
external resources, and bibliographic references. In the 
second case, a new query is run to search for aU/ 
curated/predicted pathogen genes associated with all the 
compounds present in the hst. The retrieved genes are 
shown in similar fashion to the result for a typical target 
search (Figure 1). All the history functionahties, such as 
combining query results using Boolean AND, OR and 
NOT operators as well as saving, exporting and pubhsh- 
ing query results, that were previously available for 



handhng target searches are also available for handling 
chemical data searches. 



CONNECTING KNOWN DRUG TARGETS TO 
PATHOGEN GENOMES 

Drug repositioning or repurposing, i.e. finding a new 
indication for existing drugs or drug-like molecules, can 
greatly speed up the traditional drug discovery process, 
which typically takes more than a decade to complete 
(25). An open access chemogenomics resource such as 
TDR Targets can now provide neglected disease 
researchers and pharma companies with a basic tool for 
knowledge-based drug repositioning (in a general sense). 
A key element of such a chemogenomics approach is the 
Hnking of target genes to suitable chemical inhibitors, 
and the leveraging of other relationships available in 
the database (e.g. sequence similarity between targets; 
chemical similarity between compounds). 

As mentioned above, TDR Targets now contains a 
chemical database of bioactive compounds and their 
targets from primary data sources such as ChEMBL, 
DrugBank and PubChem, in addition to data collected 
from high-throughput screenings (The Tres Cantos 
Antimalarial TCAMS dataset, the Novartis-GNF 
Malaria Box database, the St. Jude Children's Research 
Hospital Malaria dataset), and in-house curatorial efforts 
by the TDR Targets team. Many of the compounds 
included from the above sources into TDR Targets have 
known target genes, mostly from non-pathogens, and this 
information can be used to hnk the compounds to 
pathogen target genes using in silico approaches such as 
orthologue mapping and protein domain conservation 
between the known targets and novel targets. In this 
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Figure 1. Schematic view of chemogenomic searches and navigation supported in TDR Targets. (A) Targets search. Query #1 retrieves 170 genes 
from Leishmania major, Trypanosoma cruzi and Trypanosoma brucei that were associated to compounds by manual curation. Clicking on gene 
LmjF06.0860 (dihydrofolate reductase-thymidylate synthase from L. major) shows the corresponding gene page, and allows users to inspect the 
associated compounds. In TDR Targets a target resultset can be used to generate the corresponding compound resultset by clicking on the 'Show 
associated compounds' link (and vice versa for compound resultsets). Query #3 was generated in this way, and produces a list of 902 compounds 
associated to trypanosomatid genes by manual curation. (B) Compound search (textual). Query #2 was performed from the compounds search page, 
retrieving 1321 coinpounds that meet all 4 of Lipinski's rules, and that were associated to genes by manual curation. The combination of queries #2 
and #3 (INTERSECTION) can be calculated at the history page, returning 574 compounds that meet all specified criteria. (C) Compound similarity 
search. In Query #3, a 2,4-diaiTLinoquinazoline was found associated to 2 trypanosomatid genes by manual curation. In order to find additional 
related compounds with potential activity against this target, a similarity search can be performed (Query #4), retrieving another 14 compounds at 
Tanimoto similarity >0.8 (chemical analogs). 



way, users of the TDR Targets resource can tind potential 
pathogen drug targets, Hnked to a set of chemical com- 
pounds with measured activity against a related protein. 
From here, chemical scaffolds of the proposed compounds 
can be used as starting points to identify new chemical 
entities as potential drugs for the novel target. A schematic 
view of this approach is presented in Figure 1 . 

The strategy of mapping gene functional data and 
chemical bioactivity data across orthologues is a key 
mechanism by which TDR Targets establishes links 
between chemicals and target genes. Orthologues are a 
set of genes from two or more species that originated by 
vertical descent from a single gene in the last common 
ancestor. Orthologues are often functionally and structur- 
ally similar. Thus, they may be modulated by identical 
or similar molecules, making orthology assessment a 
powerful tool to connect a known druggable target with 
a potential novel target [see (26) for an example of this 
strategy]. However, non-orthologous genes can share 
homologous druggable domains. Methods to predict 
orthology such as COG (27) and OrthoMCL (7) are 



based on a reciprocal best BLAST hit step (i.e. the first 
sequence finds the second sequence as its best hit in the 
second species, and vice versa). Protein coding genes 
containing multiple domains with one of them being 
'druggable' could pass undetected because of the recipro- 
cal best hit requirement of orthology-based methods 
(even if the other domains are conserved, but rearranged 
in the protein sequence). Thus, we implemented a second 
mapping strategy in TDR Targets using BLAST, in which 
a sequence similarity search is used to identify drug targets 
with high similarity to pathogen genes (although not 
necessarily reciprocal), with the requirement that the simi- 
larity span should almost completely cover the 'druggable' 
target (>80% coverage, E-value <10~"'). With this simple 
approach we are able to connect members of a protein 
family, which may be grouped in different ortholog 
clusters, although they are structurally similar. Using 
this strategy in the TDR Targets database, 3575 known 
druggable targets (primary drug-target association is 
based on manual curation) were assigned to 1509 
OrthoMCL clusters of orthologous genes. About 50% of 
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IC50 1 .3 uM Trypanocidal activity against Trypanosoma brucei rhodeslense 
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IC50 4 uM Inhibition of Trypanosoma brucei rhodeslense trypanolhlone reductase 
(binding) 

IC50 100 uM Inhibition of human glutathione reductase 
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Targets 



>Curated pathogen genes from literature 
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Figure 2. A query strategy designed to find active compounds from a defined cliemical class. TDR Targets allows searches of the activity of 
compounds. Query #1 uses this functionality to retrieve reasonably active chemical leads that have been associated to targets. TDR Targets also 
allows users to perform substructure searches. Query #2 implements such a search, retrieving compounds containing the drawn structure as part of 
the molecule. The intersection of these two queries finds 37 active compounds from this class (2-iminobenzimidazoles). One example molecule from 
this list is shown at the bottom, with a few selected panels of information from the corresponding compound page. 



these groups (798) contain 4529 genes from tropical 
disease pathogens (and are therefore candidate druggable 
genes). The additional sequence similarity step performed 
with BLAST aUowed us to identify 2087 pathogen genes 
that were not detected by strict orthology. A number of 
example cases are depicted in Figure 3. In TDR Targets, 
these similarity hnks between targets provide additional 
navigation routes in the database, facilitating the assess- 
ment of available evidence (e.g. activity of compounds) 
for a related group of homologues. 



QUERYING THE AVAILABLE CHEMICAL 
INFORMATION IN TDR TARGETS 

The new chemical data available in TDR Targets can be 
queried in a number of different ways. Every compound 
entry in TDR Targets is associated with a number 
of searchable parameters such as basic information 
(name, synonyms, InChi and InChi Key identifiers, data 
source), cheinical properties (formula and atomic compos- 
ition, molecular weight, solubility [logP], Lipinski rule of 
5 compliance), bioactivity (if applicable), target genes (if 
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A S. mansoni metallocarboxypeptidase D 
43% id 

human plasma carboxypeptidase B 
B 5. mansoni phosphatase, putative 

human serine/threonine 
phosphatase 

C B. malayi protein kinase 

46 % id 
human Cyclin-dependent 
kinase 3 





D B. malayi peptidyi-proiy! 

cis-trans isomerase-13 

63% id 
human Cyclophilin A 

Figure 3. Links between non-homologous genes in TDR Targets. 
Many druggable target sequences can be completely aligned to 
pattiogen genes for which no druggable orthologs can be detected 
using stricter orthology methods (generally due to large differences in 
protein length between homologs). The figure depicts a number of sche- 
matic views of alignments between druggable targets (names shown in 
red) and helminth genes (% id = percentage identity). Genes repre- 
sented from top to bottom (OrthoMCL Ortholog Cluster Identifier in 
brackets, IC50 of most active compound in square brackets) are: (A) S. 
mansoni Smp_159890 (OG4_13640), H. sapiens P15169 (OG4_27945) 
[2nM, PubChem CID 194328]; (B) Smp_155200 (OG4_12097). P62136 
(OG4_10262) [0.1 nM, PubChem CID 445434]; (C) Bml_17240 
(OG4_12720), Q00526 (OG4_10184) [5|rM, PubChem CID 4369491]; 
(D) Bml_52100 (OG4_12799), P62937 (OG4_10089) [2nM, PubChem 
CID 9855081]. 



applicable), and chemical structure (Table 1). At the top of 
the compound search page (Figure 2), the user can 
perform queries based on basic information, chemical 
properties, number and type of atoms, activities, associ- 
ation with genes and type of association (curated, 
predicted or any), and information source (Table 1). The 
InChi and InChiKey identifiers are unique representations 
of a molecule that can be used to look for a specific 
compound of interest. Alternatively, chemical properties 
like the molecular weight, logP, etc., are useful to retrieve 
a collection of compounds that meet user-defined criteria. 
Users can search for desirable compounds using a 
combination of search parameters. For example, one can 
search for molecules that have a molecular weight below 
500, and 'any' described activity, or for compounds that 
are associated with genes by manual curation of literature, 
and meet all four Lipinski rules. An illustration of this 
type of chemical data search using various parameters 
is shown in Figure 3. 

Searching for specific compounds using names or 
synonyms is possible in TDR Targets, however, these 
types of searches often do not produce the expected 
results, in many cases due to the different names and 
synonyms available for a molecule, as well as possible 
alternative spelhngs (e.g. metrifonate versus metripho- 
nate). A more useful approach would be to use the 2D 



chemical structure of the compound to run an 'exact 
match' search to retrieve only the compound being 
searched for. In TDR Targets these type of searches can 
be initiated by drawing a molecule in the 'Structure-based 
searches' section of the compounds search page (http:// 
tdrtargets.org/drugs). Users can draw molecule structures 
using the JME Java applet integrated in TDR Targets 
(JME molecule editor courtesy of Peter Ertl, Novartis 
http://www.mohnspiration.com/jme/index.html), and run 
'exact', 'similarity', or 'substructure' searches. This latter 
type of search will find any compound that contains 
the drawn structure as a part of it, and is usually a good 
way to find molecules that belong to a given chemical class 
(e.g. 2-iminobenzimidazoles, see Figure 3 for an example), 
for example to analyze and compare their bioactivities. 
Another potential use of this tool is to exclude certain 
compounds from a search, for example those containing 
undesirable functional groups (e.g. reactive/toxic in vivo, 
metabolically unstable or known to cause problems 
in screening for a particular assay). In such a case, the 
users could first query the database using an initial set 
of criteria (e.g. search for aU compounds that have a 
measured IC50 below 2|iM). Then, a second query 
would retrieve aU the compounds that contain an undesir- 
able substructure. Finally, the second query can be sub- 
tracted from the first at the query history page, therefore 
obtaining a hst of compounds that meet the first criteria 
(1C50 below 2 |iM) but lack the undesirable substructure. 
Another way of finding potentially active molecules 
starting from a known bioactive compound is by similarity 
searching. To measure the similarity between the mol- 
ecules, TDR Targets implements the Tanimoto (Jaccard) 
association coefficient which is a commonly used metric 
for chemical similarity of small molecules (28,29). 
Molecules that have a Tanimoto index equal to or 
greater than 0.8 to the query molecule are retrieved. 
If the molecule is too large to be easily drawn, but the 
structure is available as a molfile (in SDF/MOL format), 
the same searches can be initiated by pasting the content 
of the molecule's molfile (text) into the corresponding 
input textbox. 

Altogether, these implemented query modes allow 
users to find either individual compounds or groups of 
compounds for later inspection of the information 
associated with them, including targets. Links to other 
resources such as ChEMBL, PubChem and ChemSpider 
(30) allow users to gather additional information 
on compounds. 



USE CASES: SOME QUESTIONS THAT CAN BE 
ANSWERED USING THE NEW FUNCTIONALITIES 

The new chemical functionalities developed and the recent 
integration of chemical data allow users to answer ques- 
tions about compounds directly (e.g. in the compound 
search page), and also formulate questions about 
pathogen targets (from the targets search page, where 
the chemical information can be used as an additional 
criteria to restrict genes searches). Relevant questions 
that are now possible in TDR Targets include the 
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following: find compounds with activity against 
Leishmania that are also associated with genes based on 
manual curation of the literature. This question can be 
answered by performing a single query in the compounds 
search page, selecting those compounds with assay 
descriptions matching 'leishmania' ('Description' box in 
the 'Activities' section), and choosing 'Curated' in the 
'Gene Associations' section. This query results in a 
hst of 63 compounds, that includes drugs such as 
pamidronate, risedronate, artemisinine, oryzalin and 
suramin. 

Another example helps to illustrate how the integration 
of genomics and chemical data can be used to select 
candidate druggable targets. In this case, the following 
question can be easily translated into a search strategy 
in TDR Targets: which P. falciparum genes have 
evidence of essentiahty in any species (through orthology), 
and have associated compounds by manual curation of 
hterature? The corresponding search strategy would start 
by querying the database from the target search page, 
first choosing the species {Plasmodium), then proceeding 
to the Essentiality section of the search form and asking 
for genes with 'Any evidence of essentiality in any species'. 
Finally, in the same page, under Druggability, one can 
further restrict the search by requesting only genes with 
'Associated compounds: Curated'. Such a query would 
produce a list of 39 genes. The user can then obtain 
a hst of the compounds associated with these genes by 
clicking on 'Show curated compounds' from the results 
page. 



THE TDR TARGETS CURATION EFFORT 

A target-based drug discovery process is guided by the 
incremental gathering of data about a target, usually 
with the final goal of validating the target. Key data 
about the target's essentiahty for the parasite, its expres- 
sion in a relevant hfe cycle stage, and its chemical tract- 
ability are all available in the literature, and can be 
extracted and integrated in the database using controlled 
vocabularies (ontologies) that facilitate querying and 
cross-relation to other database objects. As previously 
described (1), the TDR Targets team has compiled exten- 
sive literature data on phenotypic responses of targets or 
whole pathogens to genetic or chemical (pharmacological) 
perturbations. These literature data nicely complement 
the genome-scale databasets in providing additional 
target validation-related information on individual 
targets or groups of targets that have been the subject 
of more focused research. For chemicals, our structured 
representation of the effects of compounds as phenotypes 
(e.g. decreased cell growth, abnormal morphology, inhib- 
ition of catalytic activity, etc.) is distinct from others' 
(e.g. ChEMBL's) reporting of these effects as 'activities' 
(e.g. IC50's, % growth inhibition, etc.), which we have 
also imported into our database. These inconsistencies 
lead to certain challenges in querying the curated data 
(see 'Caveats' below). Nevertheless, it makes sense 
to combine our own curated datasets with others 
because there is relatively little overlap among them. 



For example, of the ~450 pathogen targets we have 
curated internally for association with compounds, only 
~20% have also been curated in the ChEMBL dataset, 
which is larger but focused mostly on non-infectious 
diseases. This limited overlap is explained in part by the 
different journals used in these curation efforts; ChEMBL 
draws mostly from medicinal chemistry journals such as 
the Journal of Medicinal Chemistry and Bioorganic & 
Medicinal Chemistry Letters, while TDR Targets draws 
heavily from pathogen-specific journals hke Molecular 
and Biochemical Parasitology and Antimicrobial Agents 
and Chemotherapy. The curatorial work of the TDR 
Targets team is an ongoing effort and in the future 
will include inputs solicited from the tropical infectious 
disease community. 

In the current release of TDR Targets, we incorporated 
data on curated vahdation credentials for Schistosoma 
mansoni and Trypanosoma cruzi, therefore completing 
curation for six key tropical disease causing organisms 
(species targeted for curation in previous releases of 
TDR Targets include: Mycobacterium tuberculosis, 
Trypanosoma brucei, Leishmania major and P. falcip- 
arum). The new data contains descriptions of 303 pheno- 
types (associated with 143 targets and 322 compounds), 
derived from genetic experiments (for 39 targets) 
and chemical experiments (for 113 targets). These data 
complement the data from previous releases for other 
diseases/organisms, bringing the number of targets with 
curated vahdation credentials to 448; and the number 
of compounds curated internally to 968. 

Additionally, we have recently integrated genome-wide 
phenotyping data for T. brucei derived from the work of 
Alsford et al. (31). These data, covering different life cycle 
stages of the parasite, and from different culture condi- 
tions, contributed genetic vahdation information for 
~7400 (80%) of the annotated protein coding genes in 
the genome, in the form of 'loss of fitness' or 'gcnn of 
^'Znev.^' phenotypes. The integrated phenotype descriptions 
derived from this work are a significant addition to 
TDR Targets. These correspond to ~30000 annotations, 
about half of all phenotype annotations previously 
available in the database. These annotations are now 
searchable and readily available for target prioritizations. 
Together with gene knockout datasets for M. tuberculosis 
integrated previously in TDR Targets, they represent the 
only two examples of genome-wide genetic validation 
data for WHO target organisms, which are the focus of 
TDR Targets. 



INCORPORATION OF DATA TO ASSESS 
ASSAYABILITY OF TARGETS 

For the purposes of this database, a target is considered 
assayable if it is an enzyme included in Signia-Aldrich's 
collection of assays, or if it has been assayed according to 
the BRENDA database (32). The BRENDA database 
contains categories for cloned and purified genes but not 
assayed genes per se, so to create our 'assayed' category we 
combined entries from the Km and Specific Activity 
categories, which give the clearest picture of whether a 
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protein has actually been enzymatically assayed. The 
mapping of the BRENDA entries to genes in TDR 
Targets was carried out as follows: (i) Mapping by 
Enzyme Commission (EC) number: EC numbers in 
BRENDA were used to map the entries to those TDR 
genes with identical EC numbers; (ii) for BRENDA 
entries where there was no match to a pathogen gene by 
EC, the gene was identified by name in the species-specific 
database (e.g. PlasmoDB) and mapped to that gene; (iii) if 
there was no gene in the species-specific database with the 
same EC or name as in the BRENDA entry, the gene was 
identified by sequence similarity using the sequence 
from the associated BRENDA hterature reference as the 
query. For each TDR species all entries for the genus were 
mapped; for example, genes that were assayed/purified/ 
cloned in Plasmodium knowlesi were mapped to P. falcip- 
arum and Plasmodium vivax. The source species for the 
data is specified in the 'Assayabihty' section on the cor- 
responding gene pages. 

Aside from having a convenient readout of activity, 
another aspect of assayability is being able to produce 
and purify a recombinant form of the protein in 
question. The largest-scale attempt to express enzymes 
from TDR Targets species has come through the 
Structural Genomics of Pathogenic Protozoa (SGPP) 
and Medical Structural Genomics of Pathogenic 
Protozoa (MSGPP) project at the University of 
Washington. Therefore each protein that has been success- 
fully expressed in recombinant form by (M)SGPP is 
annotated as such in the database, and links are 
provided for users to access additional information avail- 
able at these resources (progress in obtaining diffracting 
crystals, availabihty of plasmid clones, etc.) 



OTHER DATA UPDATES IN TDR TARGETS: 
NEW GENOMES 

Since its launch in 2007 (1), and through several releases, 
the database has integrated a number of additional 
genomes. Most important amongst recent additions was 
the incorporation of helminth genomes (Brugia malayi, 
a nematode, the causative agent of Filariasis; and 
Schistosoma mansoni, a trematode flatworm that causes 
Schistosomiasis). Data from these organisms, including 
essential annotation were derived from GenBank (5), 
SchistoDB (33) and GeneDB (3). Other genomes of 
interest for those studying parasites are T. gondii, 
the causative agent of Toxoplasmosis, sometimes used 
as a model organism for studying some aspects of 
apiconiplexan biology that are more difficult to study in 
e.g. malaria parasites; and P. vivax, an important human 
malaria parasite which cannot be yet cultured and studied 
under laboratory conditions. These genomes have also 
been integrated in TDR Targets, allowing the full 
range of operations to be performed on their targets. 

In addition to integrating the genomes of various patho- 
genic organisms, TDR targets also includes the genomes 
of various phylogenetically useful species from which 
different kinds of datasets can be mapped. The genomes 
of vertebrates (human, mouse), plants {Arabidopsis 



thaliana, Oryza sativa) invertebrates (Drosophila 
melanogaster), nematodes (Caenorhabditis elegans) and 
closely related species of pathogens already represented 
in TDR Targets (e.g. Leishmania braziliensis, infantum, 
and mexicana). These complete proteomes are part of 
the OrthoMCL database of orthologue groups (7), 
and aUow users to formulate questions such as: 'search 
for L. major proteins that are/are not present in 
L. braziliensis', or 'find P. falciparum proteins that are 
also present in plants or bacteria' (e.g. when looking 
for apicoplast associated targets). 

CAVEATS: THINGS TO LOOK OUT FOR WHEN 
SEARCHING TDR TARGETS 

A number of important clarifications can be made to help 
users of the database. These are related to the way 
chemical information has been curated and integrated in 
the database. Knowing about these issues will help users 
make sense of search results, and understand why com- 
pounds fail to appear in a result set against all expect- 
ations. First, when curating the literature, if a 
compound has been assayed against an organism (e.g. 
for inhibition of growth of Leishmania amastigotes), this 
will be recorded by a curator, even if the activity/pheno- 
type is nil (e.g. no growth inhibition). Therefore when 
searching the database for compounds with 'any activity' 
a user is actually searching for 'compounds with any in- 
formation about their activity', as the query will return 
both active and inactive compounds. Second, the activities 
of compounds are recorded in different forms and units, as 
specified by the original authors in pubhshed papers. Even 
when a significant effort is invested to standardize the in- 
formation in the database (e.g. as done by the ChEMBL 
team), the number of different ways in which activity is 
reported prevents the formulation of simple filtering 
queries such as 'show me all compounds with activity 
<5uM'. This type of query can be easily apphed to 
activities reported as IC50s, but does not make sense for 
activities reported as '% inhibition' (see Figure 3 for a 
limited fist of available activity types). In the latter case, 
the concentration of the compound used to assess 
inhibition is usually attached to the assay description 
(e.g. '% inhibition at lug/ml'), and is therefore difficult 
to query separately. Finally, it is also worthwhile to note 
here that when substructure or similarity searches are run 
using a very minimal chemical scaffolds or a very ubiqui- 
tous chemical fragment, the number of results retrieved 
can be enormous and therefore the database currently 
limits these types of searches. 

CONCLUSION 

The TDR Targets Database (http://tdrtargets.org) is an 
onhne open-access resource developed for the purpose of 
facihtating target prioritization using a comprehensive 
collection of data describing gene function, structure, 
essentiahty, assayability and druggabihty. An extensive 
set of informatic tools facilitates mining of this resource. 
The current focus of the database is on pathogenic 
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organisms that are causative agents of tropical infectious 
diseases as prioritized by the World Health Organization's 
Special Programme for Research and Training in Tropical 
Diseases (TDR) and include M. leprae, M. tuberculosis. 
P. falciparum, P. vivax, S. mansoni, T. gondii, T. brucei, 
T. cruzi, L. major, and B. malayi and its endosymbiont 
Wolbachia. Starting with release 4 of the TDR Targets 
database, chemical information has been integrated 
into TDR Targets along with cheminformatic tools to 
run chemical searches taking advantage of a variety of 
data describing chemical properties and 2D structure of 
small molecules. These new developments provide a 
chemogenomics platform that allows users to query the 
available chemical data, and investigate associations 
between targets and compounds. The web interface imple- 
mented in TDR Targets allows users to seamlessly move 
from a list of target genes to hst of compounds known to 
interact with these genes and vice versa. As pointed to 
above, several different databases are now available for 
browsing and searching chemical datasets and drug 
targets, but each of them focuses on only a subset of the 
functionalities available in TDR Targets, and almost none 
of them is focused on tropical diseases. TDR Targets, 
therefore, addresses a need that is somewhat neglected 
by comparable databases and resources. Initially de- 
veloped to integrate in one place data from parasitic 
genomes, evaluation of gene function, essentiahty and 
suitability for drug development of targets, TDR Targets 
now has extended coverage of another key component of 
the drug discovery process: chemical data including 
assays, activities, and associations with targets. 

A number of key improvements are necessary to keep 
TDR Targets useful, up to date and relevant for the 
community of scientists working on tropical diseases. 
Development of web services and other computational 
tools to facihtate reuse of data is one area that will be a 
major focus in the future. Incorporating information on 
the commercial availabihty of compounds, and providing 
links to providers is another key aspect that will be 
incorporated in future releases. But more importantly 
perhaps, a sustained curation effort is also required to 
keep valuable target validation data and compound 
activity data up to date, and to identify valuable medicinal 
chemistry data for integration in TDR Targets' chemical 
database. As mentioned above, the focus of the TDR 
Targets curation effort has been largely put on the 
gathering of information on validating credentials for 
targets. However, now that a substantial investment has 
been made into the integration of compound data, 
curation should be extended to gather other supporting 
information, such as data on assays, and on the reported 
activities of compounds (in the form of IC50s, %inhib- 
ition, phenotypes, etc.) Adequate funding is much 
needed to sustain these activities. 
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